Space-efficient computation of k-mer dictionaries for large values of k

Computing k-mer frequencies in a collection of reads is a common procedure in many genomic applications. Several state-of-the-art k-mer counters rely on hash tables to carry out this task but they are often optimised for small k as a hash table keeping keys explicitly (i.e., k-mer sequences) takes \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O(N\frac{k}{w})$$\end{document}O(Nkw) computer words, N being the number of distinct k-mers and w the computer word size, which is impractical for long values of k. This space usage is an important limitation as analysis of long and accurate HiFi sequencing reads can require larger values of k. We propose Kaarme, a space-efficient hash table for k-mers using \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O(N+u\frac{k}{w})$$\end{document}O(N+ukw) words of space, where u is the number of reads. Our framework exploits the fact that consecutive k-mers overlap by \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k-1$$\end{document}k-1 symbols. Thus, we only store the last symbol of a k-mer and a pointer within the hash table to a previous one, which we can use to recover the remaining \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k-1$$\end{document}k-1 symbols. We adapt Kaarme to compute canonical k-mers as well. This variant also uses pointers within the hash table to save space but requires more work to decode the k-mers. Specifically, it takes \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O(\sigma ^{k})$$\end{document}O(σk) time in the worst case, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma$$\end{document}σ being the DNA alphabet, but our experiments show this is hardly ever the case. The canonical variant does not improve our theoretical results but greatly reduces space usage in practice while keeping a competitive performance to get the k-mers and their frequencies. We compare canonical Kaarme to a regular hash table storing canonical k-mers explicitly as keys and show that our method uses up to five times less space while being less than 1.5 times slower. We also show that canonical Kaarme uses significantly less memory than state-of-the-art k-mer counters when they do not resort to disk to keep intermediate results.


Introduction
Strings of length k called k-mers are central in many genomic analyses.While in the past, relatively short k-mers with k less than 100 have been used to analyse short read data produced by Illumina sequencers, the development of long and very low error rate sequencing technologies such as HiFi sequencing by Pacific Biosciences has created a need to support longer k-mers to take advantage of the length of the reads.For example, when assembling this data to whole genomes using de Bruijn graphs, assemblers use k-mers with k up to several thousand base pairs to decrease branching in the de Bruijn graph [2,25].A hash table associating k-mers to values such as the frequency of a k-mer is a basic data structure used in the analysis of k-mers.This work presents a space-efficient hash table for long k-mers.
The use of k-mers is common in many tasks in the analysis of sequencing data including error correction [7,13], genome assembly [11,21], genetic variant calling [20,30], metagenomic classification [32], and repeat analysis [6,15].The first step in most k-mer-based methods is counting [17], where the aim is to compute the frequencies of all k-mers occurring in a sequencing read set.One approach to k-mer counting, taken e.g. by Jellyfish [18], DSK [26], and CHTKC [31], is to use a hash table to associate the k-mers with their frequencies.In addition to counting, k-mer hash tables are useful in applications where k-mers have to be associated with values and random access to the k-mers is needed.
The hash tables used by tools like Jellyfish [18] and CHTKC [31] are highly optimised for short k-mers.A kmer of DNA can be encoded in 2k bits and thus the kmer can be stored in each hash table entry when k ≤ 32 , i.e. the sequence fits the computer word of modern computer architectures.However, when k becomes large, this approach is space-consuming as the space complexity is O(N k w ) words, where N is the number of unique k-mers and w is the computer word size.For example, in de-Bruijn-graph-based genome assemblers for HiFi reads, it is crucial to use a large value of k to take advantage of the read lengths and decrease branching in the de Bruijn graph.However, due to the memory bottleneck caused by long k-mers, these tools resort to various ways to avoid creating a dictionary of all k-mers in the read set.For instance, MBG (Minimizer-based sparse de Bruijn Graph) [25] uses minimizer winnowing [27] to choose a subset of the k-mers and then builds a sparse de Bruijn graph using this subset.
LJA (La Jolla Assembler) [2] uses a complex procedure to directly construct the compressed de Bruijn graph, which is a memory-efficient version of a de Bruijn graph where all nonbranching paths have been compressed.LJA first builds a sparse de Bruijn graph using minimizers similar to MBG.The sparse de Bruijn graph is then used to create a set of disjointigs, a set of strings containing all k-mers of the original read set as substrings.Then a Bloom-filter is used to record all k-mers present in the disjointigs and finally the compressed de Bruijn graph is built.
We propose Kaarme,1 a space-efficient hash table for k-mers that uses O(N + u k w ) words of space, where u is the number of strings in the input data set.Kaarme is an in-memory data structure that stores for each hash table entry the last symbol of the k-mer, a pointer to a previous k-mer, and some bookkeeping bits.A key insight in Kaarme is that a common pattern in many applications is to access the entries so that each k-mer overlaps by k − 1 symbols with the previous one.This allows for fast detection of collisions in our scheme in most cases as it suffices to check that the pointer in the hash table entry matches the pointer of the previous k-mer and that the last symbol of the k-mer matches the symbol saved in the hash table entry.Additionally, we show how the hash table can be adapted to store only canonical k-mers, i.e. k-mers that are smaller than their reverse complements in some predefined order, and we give a lockfree parallel implementation of the hash table and its construction.
We compare canonical Kaarme against a regular hash table storing the whole k-mer's canonical sequence in each entry and show that Kaarme uses up to five times less space than a regular hash table while being at most 1.5 times slower.Additionally, we compare canonical Kaarme against k-mer counters that rely on hash tables and show that, for data sets where the k-mer counters do not resort to disk space, Kaarme uses significantly less RAM.

Related work
The problem of constructing a compact representation for an input set of k-mers has been studied before.These solutions typically also take advantage of the k-mers sharing long overlaps with each other.BOSS [4] uses a data structure that resembles the Burrows-Wheeler transform to represent a de Bruijn graph, i.e. a set of k-mers.UST [24] and ProphASM [5] find a small spectrum preserving string set (SPSS) which is a set of strings containing all k-mers in the input k-mer set and Eulertigs [28] gives a minimum SPSS.Shibuya et al. [29] consider compressing a k-mer count table using compressed static functions and minimizers.Finally, Pibiri [22] and Pibiri et al. [23] consider the problem of associating each k-mer in a k-mer set with a unique integer in the range [1 . . . n] where n is the number of k-mers.However, all these tools require that the set of k-mers has already been counted, which is exactly what Kaarme hash table is designed for.
Bifrost [10] can directly compute the compacted de Bruijn graph from input reads.However, it outputs the compacted de Bruijn graph, which does not allow values to be associated with single k-mers like our hash table.Methods such as TwoPaCo [19] and Cuttlefish [14] compute a de Bruijn graph for genomic sequences.These tools exploit the length of the sequences and the fact that all k-mers will be kept in the final data structure, unlike when working with reads, where k-mers with low counts are often discarded.
Some k-mer counters [8,9,16] exploit the overlaps between k-mers by using super k-mers and minimizers.A minimizer of a string is the smallest substring of a given length in some predefined order, and a super k-mer is a string consisting of consecutive k-mers that share the same minimizer.These k-mer counters first split the reads into super k-mers, and the super k-mers are then partitioned into bins so that the super k-mers sharing the same minimizer end up in the same bin.Since all k-mers with the same minimizer are now in the same bin, different bins do not share any k-mers and can thus be counted independently, allowing for efficient parallelization.

De Bruijn graphs
The order k de Bruijn graph (dBG) G = (V , E) of a string S[1 . . .n] over the alphabet is a labelled directed graph that encodes the distinct k-length substrings of S. Each node v ∈ V is labelled with one of the distinct k-length substrings.Thus, if S has N distinct k-length substrings, G has N nodes.Additionally, two nodes u and v are connected by an edge (u, v) ∈ E if the k − 1-length suffix of label(u) matches the k − 1-length prefix of label(v), and the collapse of the overlap yields a k + 1-length sequence that exists as a substring in S. The label of (u, v) is the rightmost symbol of label(v).

String fingerprints and rolling hashing
The Karp-Rabin method [12] computes fingerprints (integer values) for strings of arbitrary size.Given a sequence K [1 . . .k] over the alphabet , a fingerprint function h p : � k → [1 . . .p] is defined as where q ∈ [1 . . .p] is an integer chosen uniformly at ran- dom and p is a prime number.Karp-Rabin fingerprints can be updated in constant time.Given 1) time (among other opera- tions).The fast update makes Karp-Rabin fingerprints the preferred solution to implement rolling hashing in linear time.That is, given a string T [1 . . .n] and an inte- To insert the strings visited by the rolling hash into a hash table of size m, it is convenient to combine h p with a universal hash function with m < p .By using a large prime p, the probability for two random strings over k to be assigned the same integer in [1 . . .m] is close to 1/m.

Definitions
We consider the RAM model of computation.Given an input of n symbols, we assume our procedures run in random-access memory, where the machine words are w = �(log n) bits in length and can be manipulated in constant time.
Let {a, c, g, t} be the DNA alphabet, and let = {0, 1, 2, 3, 4} be another set of size σ = |�| = 5 to which we map the DNA alphabet as a = 1 , c = 2 , g = 3 , t = 4 .For technical reasons, we define ε = 0 ∈ � as the empty symbol.Additionally, we regard the DNA complement as a permutation π[1, σ ] that reorders the symbols in , exchanging 1 = a with 4 = t , 2 = c with 3 = g , and keeping the same value π[ε] = ε for the empty symbol.The reverse complement of R ∈ * , denoted R , is the string formed by reversing R and replacing every symbol We consider a collection R = {R 1 , . . ., R u } of u strings over the alphabet * , with a total length of n = ||R|| = u i=1 |R i | symbols.Let h p : � k → [0 . . .p − 1] be a function that maps k-length strings to integers in [0 . . .p − 1] uniformly at random, where p is a prime number.We define the canonical version of a k-mer K, denoted K c , to be the string K c ∈ {K , K } with the smallest value in h p .If h p assigns the same integer to K and K , we set K c equal to the smallest string in lexicographical order between K and K.
A k-mer dictionary D k,R is a dictionary where the keys are the distinct k-length substrings of R (i.e., the k-mers), and their associated values are the frequencies of those k-mers in R .The canonical k-mer dictionary D c k,R stores as keys the canonical k-mers of R .The value associated with each key K c in D c k,R is the sum of the frequencies in R for K and K.
Our framework consists of three algorithms: 1. GetDict(R, k) ("Building the dictionary" section): recieves R as input and returns a hash We assume probe(h(K), m, d) computes the bucket using quadratic probing.

Dictionary of k-mers
We begin by describing how to build the k-mer dictionary D k,R efficiently."Our compact hash table" section presents our compact data structure that exploits the redundancy of consecutive k-mer in R .Then, in "Build- ing the dictionary" section, we explain our algorithm GetDict, which builds D k,R using this compact data structure."Connection with de Bruijn graphs" section describes the link between our method and the de Bruijn graph of R .Finally, in "Space and time complexity" sec- tion, we present the space and time complexities of Get-Dict, and in "Improving the time complexity" section, we show how to improve its running time.

Our compact hash table
We devise a compact hash table H where the keys are the distinct k-mers of R , and the associated val- ues are their frequencies.We reduce space usage by exploiting the fact that k-mers occurring consecutively in R contain redundant information.Specifi- cally, let be two consecutive substrings in R i ∈ R , with k = y − x + 1 and K pr = K .Our simple observa- tion is that, storing K pr and K explicitly in H pro- duces a redundant copy of the In our encoding, a bucket H[b] stores a k-mer K in relative form along with its frequency in R .This repre- sentation has three fields H[b] = (f , r, a) , where f is the frequency of K in R , r is another bucket in H where we can recover the prefix K [1 . . .k − 1] and a = K [k] ∈ is the rightmost symbol in K.We refer to H[b].r = b pr as the reference bucket for K.We also keep a dynamic buffer B to store the k-mers that we cannot encode immediately in relative form.This situation occurs when, during the execution of GetDict, we visit the kth prefix

Building the dictionary
We now describe GetDict, our method to compute the dictionary D k,R that relies on the compact hash table H of "Our compact hash table" section.Algorithm 4.2.2 contains more details about its implementation.
We start the algorithm by defining a rolling hash function h : � k → [1 . . .m] that maps k-mers to buckets in H uniformly at random as described in "Preliminaries" section, and by creating an empty dynamic buffer B. Subsequently, we scan R from left to right, and for every k-mer K = R[x . . .y] we visit, with R ∈ R , we call the operation incval(H, K, 1) ("Definitions" section).Lines 28-33 depict this idea.
Before describing how incval works, we will introduce some useful notation.Let K = R[x . . .y] , with y − x + 1 = k , be the active window in the scan of R during the execution of GetDict.Additionally, let b pr be the bucket in H where we inserted the predecessor k-mer K pr = R[x − 1 . . .y − 1] (i.e., the reference bucket of K).We assume b pr is null if K is the kth prefix of R. For convenience, we also change the signature of incval to incval(H, K , 1, b pr ) = b , where b is the bucket where K resides.
When we call incval(H , K , 1, b pr ) , we probe buckets in H using the function probe ("Definitions" section) until we find one that either is empty or has a key that matches K. Every time we probe a non-empty bucket H[b ′ ] , we check if K matches its key in O(k) time by following bucket references recursively.We refer to this procedure as keycomp: The last aspect we will discuss in this section is the implementation of keycomp(H, b ′ , K , b pr ) (Lines 1-9).
Figure 1 shows an example of the compact hash table H we obtain with GetDict.

Connection with de Bruijn graphs
It is not difficult to see that our compact hash table resembles the de Bruijn graph.Let G = (V , E) be the de Bruijn graph of order k obtained from R .The pair (H, B) encodes a graph

and the field H[b].r represents an edge connecting v with one of its incoming nodes
G ′ is a sparse version of G because some edges of E might not be present in E ′ since every bucket H[b] stores at most one incoming edge for v, and the remaining ones are ignored to save space.We remark that (H, B) offers limited navigational functionality for G ′ : each node v can only visit its incoming node u (via H[b].r) and retrieve the symbol H [b].a that labels (u, v) ∈ E ′ .Still, this feature is enough for us to implement keycomp during the probing phase of incval.

Space and time complexity
We now describe the upper bounds of our framework.We will refine these results in the following sections.Proof The rolling hash function h allows us to get h(K) from the preceding value h(K pr ) in O(1) time.Thus, obtaining the buckets for all the k-mers in R takes O(n) time.When we visit K in R , the probing mecha- nism of incval(H, K , 1, b pr ) will start to probe buckets from H[h(K)] until it finds one that either is empty or encodes a key matching K.The classical result on hash tables tells us that, by choosing a hash function h with collision probability 1/m, and setting the load factor of H to a constant value α , the number of probes to find a bucket for K is O(1) in expectation.Still, every time the probing mechanism visits an occupied bucket, it has to call the function keycomp to compare keys, which takes O(k) time.Thus, the call of incval(H, K , 1, b pr ) takes O(k) in expectation.Summing up, the complete running time of GetDict is O(nk) in expectation.H uses O(|V|) words of space as there is one bucket for each de Bruijn graph node v ∈ V , and there are O(m(1 − α)) empty buckets, m being the hash table size.The final aspect to consider is the space usage of B: there are at most u k-mers in R that (potentially) do not have a reference in H, that is, the kth prefix of every R i ∈ R , with i ∈ [1 . . .u] .Assuming uses ⌈log σ ⌉ = 3 bits per symbol, then each of these k-mers use ⌈3k/w⌉ words, and thus the total space for B is O(u k w ) words.As a conclusion, the total space usage of (H, B) is

Improving the time complexity
The running time O(nk) of Theorem 1 is a rather pessimistic upper bound for GetDict as invcal(H, K , 1, b pr ) does not always require k operations.The only case when In that case, we need to match K against the key in H [b ′ ] to be sure they are equal.We can express this situation in terms of de Bruijn graphs: the bucket H[b ′ ] encoding the node v labelled label(v) = K stores an incoming edge . In this case, we have available a preceding k-mer K pr = cgtt and its bucket b pr = 4 .Thus, we encode gtta as H[4] = (1, 4, a) .We continue with the remaining k-mers of R in the same way In an ideal scenario, where we know the set In reality, however, in the best case, we know one incoming node for v, i.e., the node u ′ encoded in the bucket H[b pr ] .Still, we can get closer to the ideal scenario by increasing the space of our data structure.
We keep an auxiliary hash table H e that stores indexes of H as keys.Each key in H e is associated with a list of up to σ − 2 = 3 integers.Thus, for a k-mer K encoded in the bucket H a new de Bruijn graph edge, and that cost was already covered.On the other hand, if K = K ′ , it means K and K ′ collide.Assuming the k-mers collide at random in h, the average number of symbols to determine that two random strings do not match is constant [1].Now, assuming GetDict found colliding k-mers q times during the scan of R , the total cost of failed k-mer decompression is O(q) on expectation.This argument gives us the final

Dictionary of canonical k-mers
We present our framework to compute the canonical dictionary D c k,R .We first describe how to adapt our compact hash table to the canonical setting ("Our canonical compact hash table" section).Then, we show how to extract keys in the new encoding ("Reconstructing a k-mer from our canonical encoding", and "Implementing our k-mer retrieval algorithm" sections), and the associated cost of the extraction ("Analysis of our k-mer retrieval algorithm" section).Subsequently, we present GetCanDict: our space-efficient algorithm that builds D c k,R using the canonical variant of our compact hash table ("Building the canonical dictionary" section).Finally, "Correctness of our canonical encoding" section explains the correctness of the output of GetCanDict.
We remark that the canonical framework we present here does not improve the asymptotic space usage of "Building the dictionary" section, but in practice, it reduces the number of hash table entries by one-half.Additionally, the ideas we present here do not consider the improvements of "Improving the time complexity" section.

Our canonical compact hash table
The main difference compared to our previous scheme is that now every k-mer K occurring as a key in H is the canonical sequence K c ∈ {K , K } .This strategy col- lapses K and K c into one single bucket and reduces H's overall size.However, it also invalidates our mechanism to spell the keys right to left as consecutive k-mers in R do not necessarily have canonical versions in the same DNA strand.In terms of the de Bruijn graph ("Connection with de Bruijn graphs" section), the reference bucket b pr of a k-mer K c now could be a incom- ing or outgoing node of K c .This change makes the retrieval of k-mers from H more difficult compared to the non-canonical variant because our original decoding method spells K c by following only incoming nodes (see "Building the dictionary" section).We will adapt our technique to overcome this problem, giving a solution that keeps the advantages of the canonical encoding at the expense of performing more computations in H.
Our canonical encoding is closely related to the way in which our algorithm GetCanDict works ("Building the canonical dictionary" section).However, we cannot explain that algorithm without first describing the new compact encoding.For the moment, it is enough to know that, during the execution of GetCanDict, we scan the reads in R left to right, and for each k-mer R[x . . .y] , we obtain its canonical sequence K c and insert it in some bucket H [b]. The reference bucket b pr we store pr we can use as reference, so we store K c explicitly in a buffer B.
A relevant concept in our new scheme is that of text overlap: be an occurrence of a string in {K pr , Kpr } , not necessar- ily the leftmost one.The text overlap of K c pr and K c is the overlap between the strings of {K pr , Kpr } and {K , K } that match R[x − 1 . . .y − 1] and R[x . . .y] , respectively.This overlap always matches a k − 1 suffix in {K pr , Kpr } with a k − 1 prefix in {K , K }.

The important observation about this definition is that the orientation
pr ⊕ K c like in the non-canonical version of H. There- fore, when we set the DNA orientation of H In general, there are four possible text overlaps for K c pr and K c , each with a specific DNA orientation in the spelling of K c . Figure 2 summarises the combinations and their outcomes.
The use of text overlaps to encode the keys requires a new format for H.In particular, the entry

Reconstructing a k-mer from our canonical encoding
Now that we have relaxed the orientation in which K c over- laps its predecessor K c pr , the reconstruction of K c might take more than k steps.We will slightly change the notation to explain this idea.Let S = b k ′ , . . ., b 2 , b 1 be the chain of reference buckets we visit in H to spell K c , with and K c 2 is equivalent to that of K c and K c pr .The general idea to reconstruct K c from H[b] is to scan S right to left, and for every bucket H [b j ] we visit, we extract the symbol from one of the ends of K c j and insert it into K c .We refer to this procedure as canspell: • canspell(b): returns the sequence K c from the input bucket H[b] of our compact hash table H.
In the non-canonical encoding of "Our compact hash table" section, we spell K c right to left as it holds In contrast, in the canonical encoding, the text overlaps of the k-mers in S do not induce a particular spelling direction.As we see in Fig. 2b, if and we retrieve a symbol from the right end of K c .On the other hand, if and we get a symbol from the left end of K c .Changes in the spell- ing direction can happen multiple times as we scan S and induce an inward reconstruction of K c : we obtain the end symbols of K c and advance to its centre.
We also need to consider the orientation of the k-mers in S relative to K c .In other words, for each K c j in S, we need to know the string K o j ∈ {K c j , K c j } overlapping K c according to the information in S. Before explaining this concept formally, we will define a function ror S that gives the relative orientation.This is its signature: • ror S (j) : returns 0 if K o j = K c j , and 1 if K o j = K c j .
We implement ror S using the following recursive function: In contrast, we move the window to the right: 3b shows an example of this idea.
The implementation of canspell(b) thus translates to extracting symbols from the distinct k-mers K o j we visit as we slide w ℓ , w r , stopping only when we have covered all the positions of K c .As mentioned before, this mech- anism reconstructs K c inwards.
We will keep two variables c ℓ = 1, c r = k that mark the inner ends of the reconstruction.Initially, the symbols within K c [c ℓ . . .c r ] are unknown, but we will obtain them as we slide w ℓ , w r .When we recover K c [c ℓ ] , we update the inner left end c ℓ = c ℓ + 1 , and when we obtain K [c r ] , we update the inner right end c r = c r − 1 .Notice that canspell will run as long as c ℓ ≤ c r . (1) Our canonical encoding only enables the extraction of the symbols K o j [1] and K o j [k] of each k-mer K o j (fields a and v in H), meaning that we cannot recover K c [c ℓ ] or K c [c r ] every time we visit a bucket in S. We consider the following two scenarios to extract symbols: It is also worth mentioning that when we reach a bucket b j in S whose key is in B, we have direct access to the full sequence of K o j .Therefore, we copy the corresponding area of K o j within K c [c ℓ . . .c r ] and finish canspell(b).This condition makes b j the last bucket in S.
We now show that canspell always returns an answer and that that answer is the correct sequence of K c .

Lemma 3 The execution of canspell(b) always finishes and returns a k-mer.
Proof We first demonstrate that S does not have loops b j , b j+x , . . ., b j+1 , b j .This type of structure would pro- duce that, after visiting b j+x , we return to b j and thus never end the reconstruction of K c .GetCanDict fills H keeping the following invariant: when we insert K c j in b j , the bucket H[b j+1 ] already encodes K c j+1 and H [b j ] is empty, so it is safe to store H[b j ].r = b j+1 .The invari- ant, in turn, induces the transitive property that all the buckets b j , b j+x , . . ., b j+1 of S already contained a k-mer when we inserted K c j .Further, H[b j ] was empty and K c j did not exist as a key up to that point.However, the loop We remark that the absence of cycles in S (Lemma 3) implies that not all the k-mers encoded by S overlap K c .Suppose that, starting from K o j , we slide the window x positions to the right and then x ′ < x positions to the left.When we return the window back to the left, the k-mers we visit are not the same as those we visited when sliding the window to the right, otherwise it would mean S has a cycle.If we combine this idea with the fact that the k-mers in S have transitive overlaps, we have that K o j does not have a suffix-prefix overlap with K o j+x+x ′ , but a substring match (see the vertical lines in Fig. 3b).However, this condition does not prevent us from reconstructing K c .

Lemma 4 canspell(b) returns the canonical k-mer K c encoded by the bucket H[b].
Proof We show that each k-mer K o j we obtain from S has a substring matching K c [c ℓ . . .c r ] at the moment we reach the bucket b j .We refer to this idea as the match- ing property.The validity of the matching property proves canspell outputs K c correctly because the algo- rithm obtains K c as we moved c r inwards after processing H [b j ] .Now assume that the next u buckets in S slide the window u positions to the right (i.e., we change the sliding direction).We distinguish three cases: is not a prefix or a suf- fix because j − u > 1 and k − 1 − u < k , and our encoding does not support direct access to this area of K o j+u .This situation means that we cannot shrink K c [c ℓ . . .c r ] as we visit K o j+u , or any in any of the k-mers x [1] in our encoding, we can retrieve K c [c ℓ ] and move c ℓ = c ℓ + 1 .Thus, the inner left end becomes c ℓ = c ℓ + u − j + 1 after visiting b j+u .
After consuming b j+u , . . ., b 2 , it might happen that the window changes direction again to the left.However, in this scenario, symmetrical conditions to those explained above apply.
We will present a formal implementation of canspell in the next section.

Implementing our k-mer retrieval algorithm
This section describes the practical aspects of implementing canspell(b) = K c .We present the pseudocode in Algorithm 2 and explain all the details below.
We begin (Lines 1-3) by initialising the variables c ℓ = 1, c r = k that mark the inner left and right ends of K c (respectively) in the inward reconstruction.We also define a bit q = ror S (j) that tells us the string K o j ∈ {K c j , K c j } overlapping K c .We set the initial value q = ror S (1) = 1 according to Eq. 1.
We continue canspell by traversing S right to left.Every time we reach a new bucket H [b j ] , we perform three steps: (i) Check if the key of H [b j ] is in the buffer B.
(ii) Check if the window w ℓ , w r crosses the inner ends of K c [c ℓ . . .c r ]. (iii) Slide the window in some direction and move to the next bucket Lines 5-13 show the process of step (i).Recall from "Our canonical compact hash table" section that we store k-mers explicitly in a dynamic buffer B whenever we do not have a predecessor we can reference in H, flagging the buckets in this situation with ε .Thus, when the tra- versal of S reaches a bucket b j such that r .The advantage of the buffer is that it contains all the information we need to complete the inner substring K c [c ℓ . . .c r ] (see Lemma 4).Therefore, we get the We fin- ish the execution of canspell by returning K c .On the other hand, if the key of H [b j ] is not in B, we move to step (ii).Lines 14-19 represent the work of step (ii).We start by computing the ends of K o j using the bit q.Specifically, if q = ror S (j) = 1 , we get We continue by checking if the window w ℓ , w r crosses an inner end of K c .If that is the case, we insert K o j [1] or K o j [k] in K c depending on which matching scenario holds (see Cases 1 and 2 at the end of "Reconstructing a k-mer from our canonical encoding" section).
Lines 20-30 depict the work we perform during step (iii).We first infer the direction (left or right) in which we slide the window.For that purpose, we rely on the text overlaps we presented in Fig. 2b.Recall that our encoding sets When computing the sliding direction, we also need to consider the orientation of K c j relative to K c .In particular, if ror S (j) = 0 , we invert the slide direction, otherwise we leave it unchanged.It is not difficult to see why we need to do this inversion: suppose H[b j ].r = b j+1 defines the overlap K c j+1 ⊕ K c j but q = ror S (j) .The overlap indicates that we need to slide the window to the left, but the bit in q indicates we need to flip the overlap to K c j ⊕ K c j+1 , which is equivalent to sliding the window to the right.

Analysis of our k-mer retrieval algorithm
We now analyse the cost of running canspell.We will see that its execution is exponential in the worst case, but our experiments showed that it is much faster in practice (see Fig. 9).We summarise the running time of canspell with this theorem:

in the bucket H[b] of our canonical compact hash table H.
Proof Since all buckets in S are distinct, they represent different canonical k-mers.There are at most O(σ k ) different canonical k-mers and thus canspell(b) visits at most O(σ k ) buckets.Processing each bucket requires constant time, and thus, the function canspell(b) has time complexity O(σ k ) .
Our proof of Theorem 5 demonstrates in a simple way that the running time of canspell is exponential in the worst case.However, the proof is incompatible with our proof of Lemma 4, which states that all the k-mers K o j encoded by S share a substring with K c .In the follow- ing, we will present an analysis that shows that we have an exponential upper bound for our k-mer retrieval algorithm even if all the K o j in S have a match with K c .We will assume canspell only extends the inner left end c ℓ of K c as it traverses the buckets of S, and maintains the invariant that c r = k .This assumption is only to simplify our explanations, and it does not affect the correctness of the reconstruction.
Let o be the number of symbols in K c we have already discovered at some point in the execution of canspell.As we fixed c r = k , it holds o = c ℓ .We will prove our the- oretical bound by studying, for each value of o ≤ k , the maximum number of buckets we could visit in S before moving c ℓ one position.We finish the construction of K c after moving c ℓ k times.
When o = 0 , the cost of moving c ℓ is O(1) as the first bucket b 1 in S is precisely the one encoding K c , and our compact encoding Now we study the case 0 < o ≤ k .Figure 4 depicts a graphical example of our argument.We consider two copies of the trie encoding the strings in o .We will refer to them as I and D, and will associate I with moving w ℓ , w r to the left and D with moving w ℓ , w r to the right.Let u h be a node at height h in any of these tries.We will use the notation u h , u h+1 to indicate we descend from u h to its child u h+1 , and u h+1 , u h to indicate we climb from u h+1 to its parent u h .
Let u h ∈ I be a node at height h and let v o−h ∈ D be a node at height o − h .The operator label I (u h ) returns the string spelled by the path u h , u h−1 , . . ., u 1 from u h to the root u 1 of I, while label D (v) is the string spelled by the path v 1 , v 2 , . . ., v o−h starting at the root v 1 of D and ending at v o−h .The string label I (u forms one of the k-mers encoded by S before moving the inner end c ℓ .Although different nodes u o ∈ I and v o−h ∈ D form different k-mers, Lemma 3 limits the node combinations that form k-mers we could see in S. We will explain this idea below.
Moving w ℓ , w r to the left by x ≤ o − h positions trans- lates to descend a path u h+1 , u h+2 , . . ., u h+x in the sub- tree of I rooted at u h , at the same time it means climbing the Moving the window to the right represents the opposite operation: climbing the ancestors of u h+x and descend- ing a path from the subtree of D rooted at v o−h−x .How- ever, when we move to the right, we cannot descend the same path v o−h−x+1 , v o−h−x+2 , . . ., v o−h we climbed in D before because it would mean visiting k-mers of S more than once, which contradicts Lemma 3. Symmetrically, once we choose a path in D to move to the right, if we require to move to the left again, we can not descend u h+1 , u h+2 , . . ., u h+x in I.
In conclusion, sliding w ℓ , w r back and forward over a node in I or D cancels branches, reducing the formation of k-mers that we could see in S. Figure 3B depicts this situation with vertical lines.The cancellations produce each internal node v in I or D to be associated with σ − 1 different k-mers in S as we can descend only once through each of v's children and then climb back to v. We use the remaining child to slide the window in the same direction we did before, thus keeping the invariant of Lemma 3. On the other hand, every leaf in the tries is associated with one k-mer of S. Once we exhausted all the possible visits in all the nodes in I and D, we do not have any other choice but to slide w ℓ , w r to the right and move c ℓ .
The trie with the strings in o is a full k-ary tree of degree σ , meaning it has σ o leaves and σ o − 1/σ − 1 internal nodes.If we add the cost of the σ − 1 possible visits to an internal node, we obtain a cost of σ o − 1 for processing the internal nodes and a total cost of 2σ o − 1 for processing the full trie.Now let us assume we moved the inner end c ℓ one position, so now we know o + 1 symbols of K c .Let I o , D o be the tries we used for o and let I o+1 , D o+1 be the new tries we will use for o + 1 .We know that I o is a subtree of I o+1 (respectively, D o of D i+1 ) that we already traversed when the number of known symbols of K c was o.Therefore, if at some point in the synchronized traversal of I o+1 and D o+1 , we visit two nodes u h and v o+1−h such that u h belongs to the subtree I o and v o+1−h belongs to D o , then we would form a k-mer we already visited, thus breaking Lemma 3. Therefore, the cost of moving c ℓ when o + 1 becomes (2σ o+1 − 1) − (2σ o − 1) as we discard the subtrees I o and D o of visited k-mers.Further, if we consider all the possible values for o, we obtain Equation 2 is a sum that telescopes to 2σ k − 1 , so we obtain the running time O(σ k ) for canspell.
The analysis we presented here is rather pessimistic as having full tries I and D implies that the input read collection R produces a complete de Bruijn graph with order k, which is hardly the case in practical scenarios.

Building the canonical dictionary
Our algorithm GetCanDict constructs the canonical dictionary D c R,k using our compact data structure of "Our canonical compact hash table" section.The idea is to scan R left to right, and for each k-mer K = R[x . . .y] , we compute its canonical K c ∈ {K , K } and insert K c into H using the operation incval (see "Building the dictionary" section).The most relevant change of GetCan-Dict compared to GetDict is the implementation of keycomp(H, b ′ , K c , b pr ) , the subroutine that incval calls We start the canonical variant of keycomp by comparing the tuple We also introduce a small modification to incval to maintain the correctness of the dictionary.In the noncanonical scheme, when we call incval with a k-mer K that currently is in B, but now we have a predecessor bucket b pr for it, we remove K from B and update the predecessor reference in H [b].r = b pr (see Line 24 in Algorithm 4.2.2).In the canonical variant of incval, we do something similar: we remove a canonical sequence K c from B if we now know a bucket b pr we can use for its reconstruction.However, we also need to ensure that all k-mers in H can be reconstructed after this update.This invariant can be violated if the reconstruction of the key K c

Correctness of our canonical encoding
We will show that the canonical encoding of H and the way GetCanDict works do not prevent the correct construction of the k-mers in D c k,R .The only important aspect to demonstrate is that keycomp always returns the correct answer.

Reporting the k-mers in the compact dictionary
Our framework also implements an algorithm to report the k-mers of a compact hash table H in uncompressed form.We refer to this procedure as DumpDict.It works by following the reference chains to reconstruct the k-mers and write them into an output file along with their frequencies, provided the frequency is above some input threshold τ ."Non-canonical scheme" section explains how DumpDict works when H follows the non-canonical scheme, and "Canonical scheme" section describes how it works when H follows the canonical scheme.

Non-canonical scheme
In the non-canonical version of H, the process is simple: we scan H left to right until we find the leftmost occupied bucket H being the only bucket that does not have a refer- ence (i.e., K k ′ is the dynamic buffer B).We also assume that S i has length |S i | = k ′ ≥ k , meaning that the chain encodes one or more k-mers together.We take the  [1] to get K 2 .We then check its frequency in H[H [i].r] and decide whether to print it or not using τ .We continue traversing S i until reaching K k ′ , or until we encounter a k-mer that has already been printed.To do this, when a bucket is visited, we mark it so we do not later try to print the same k-mer again.After we process all the k-mers in S i , we move to the next occupied bucket H [i ′ ] , with i ′ > i , in the scan of H, and start to process the corresponding reference chain S i ′ , unless H [i ′ ] is marked as processed.In this case, we move to the next occupied bucket and continue with the same idea until we finish the scan of H.
We need the k-mer reference chains to be as long as possible to make the writing process more efficient.The reason is that we need to visit k buckets to reconstruct the first k-mer in a reference chain S i , but all subsequent k-mers in S i can be reconstructed by visiting one new bucket.So, instead of starting the writing process at the leftmost occupied bucket H[i], we can scan H once and mark all buckets that are referenced by another k-mer.Now, all that are not marked are either empty or contain a k-mer that is not referenced by another k-mer.We can then start writing k-mers from these unmarked buckets to maximise the reference chain lengths and reduce the time needed to print the k-mers.

Canonical scheme
The k-mer writing process in the canonical variant of H is more involved but follows the same idea we described in "Non-canonical scheme" section.The first k-mer K c 1 in a reference chain 1 that does not belong to K c 2 , removing it, and adding the missing DNA symbol from H[b j+1 ] to W. The reconstruction of K c 1 from H[b 1 ] might require visiting more than k buckets as we call canspell (the reason is in "Implementing our k-mer retrieval algorithm" section).However, after we get K c 1 , we obtain the subsequent k-mers from b k ′ , . . ., b 2 by vis- iting only one new bucket for each.

Implementation details
We implemented2 GetCanDict ("Building the canonical dictionary" section) and the variant of DumpDict that processes the canonical compact hash table ("Canonical scheme" section) in C++.We refer to this software as Kaarme.We did not implement GetDict and the variant of DumpDict that deals with the non-canonical hash table because most genomic analyses only use the canonical dictionary.Our source code implements the function incval in GetCanDict using compare and swap (CAS) atomic instructions to record the k-mers of R in paral- lel in a lock-free manner.To make the procedure more space efficient, we included a filtering step so GetCan-Dict can ignore most of the k-mers that do not appear at least twice.More specifically, we use two Bloom filters [3] where the first Bloom filter includes all k-mers occurring at least once in the data set, and the second Bloom filter includes all k-mers occurring at least twice in the data set.Thus, when first encountering a k-mer, we add it to the first Bloom filter.If a k-mer is already found in the first Bloom filter, we add it to the second Bloom filter.
Only k-mers that are found in the second Bloom filter are added to the hash table of Kaarme.Because Bloom filters allow false positives, some k-mers with a single occurrence can be inserted into the hash table, but these are easily filtered out in the end when reporting the k-mers.

Competitor tools
We compared our software (Kaarme) against the following methods: 1. Plain: a multi-threaded k-mer counter that uses a generic lock-free hash table implemented by us.The hash table stores the full k-mer sequences in a twobits-per-symbol format, along with the frequencies.2. Jellyfish [18]: a k-mer counter using a multithreaded lock-free hash table.3. CHTKC [31]: a semi-external k-mer counter.When the hash table is full, CHTKC stores all subsequent new k-mers on disk to be processed in a later batch.4. DSK [26]: a disk-based k-mer counter that partitions the input and stores the partitions on disk. 5. Gerbil [9]: a k-mer counter with GPU support designed to efficiently count k-mers for large k.
We implemented Plain as a module within Kaarme.
We use the flag -m to tell our software to either use our compact hash table scheme or Plain.Additionally, Plain and Kaarme require the user to estimate the number of distinct k-mers in the data set for them to calculate the bloom filter size.We computed the estimate by running DSK on the datasets to obtain the number of distinct k-mers.Jellyfish also requires an estimate of the number of distinct k-mers, so we gave it the value reported by DSK.CHTKC requires the user to define the maximum amount of memory it is allowed to use.We set this value to 15 GB, close to the maximum available memory of the used machine.

Datasets
We used three read collections for the experiments: 1. ecoli280: 280x coverage PacBio HiFi Escherichia coli reads (acc: SRR10971019).2. ecoli100: 100x coverage downsampled version of ecoli280.(The number of k-mers in the hash table is close to the number of reported k-mers.Few with count = 1 slip through the bloom filter.) internal timer was used.The run time and memory usage is shown in Fig. 5. Missing results indicate that a tool could not count the k-mers with the available amount of memory.
To illustrate the difference in memory usage for different values of k, statistics of Kaarme memory usage per distinct k-mer in the hash table can be seen in Fig. 7. To show how much time each procedure of Kaarme takes, in Fig. 8, the run times are split into three different parts: Bloom filtering, counting, and dumping.Figure 6 shows the estimated memory usages of the main Kaarme data structures.Note that the Bloom filter size includes both Bloom filters, which is halved by deleting the first bloom filter before we proceed to the k-mer counting step.
Finally, we measured the average number of buckets visited when decompressing a k-mer in the canonical dictionary built on the ecoli100 data set.We first constructed the canonical dictionary D c k,R .Then, k-mers on the occupied entries of the hash table were decompressed, and the number of visited buckets during decompression was recorded.The average lengths of these k-mer decompression chain lengths are shown in Fig. 9.
The experiments were run on a laptop with 16 GB of RAM, Intel Core TM i5-8250U CPU @ 1.60GHz × 8 pro- cessor, and a 64-bit Linux-based OS.

Results and discussion
First, we compare Kaarme to Plain. Figure 5 shows that Kaarme uses significantly less memory than Plain.On ecoli100 with k = 51 , the space usage of Kaarme is about 70% of the space usage of Plain (362MB vs 529MB), and the difference grows as the value of k increases.On ecoli280, the difference is even more significant with larger values of k.On dmel20 with k = 51 , Kaarme also uses about 40% less memory than Plain, and Kaarme could run with k up to 301 while Plain ran out of RAM already when k was set to 151.However, the reduced space usage does not come completely without a cost.The runtime of Kaarme is longer compared to Plain (for example, 151 s vs 122 s with k = 51 on ecoli100).
In all the datasets, the space usage of Kaarme was dominated by the hash table and Bloom filters, while the secondary buffer took on average less than 11% of the total space usage (see Fig. 6).We remark that our hash table's compact encoding uses a constant number of bits per k-mer (regardless of k), and that our experiments showed that the secondary buffer contributes little to the memory peak.Therefore, we expect the space usage per distinct k-mer to grow very slowly with Kaarme, and to grow linearly with k for Plain.This is indeed the case as shown in Fig. 7. On the other hand, Fig. 8 shows that the running time of Kaarme is dominated by the counting phase, especially when k grows.
We see in Fig. 5 that, when compared to other k-mer counters, Kaarme uses the least amount of memory in all other experiments except on dmel20, where Gerbil is more memory efficient with k < 300 .However, Kaarme is slower than the other k-mer counters with the exception of Jellyfish on some data sets where its memory usage is close to the total RAM available on the machine.
We remark that Kaarme only implements compact inmemory hash tables that are suitable for k-mers, while our competitors are full-fledged counters that combine hash tables with other techniques.Thus, a comparison of Kaarme against these tools is not completely fair.This observation is particularly true for Gerbil, CHTKC, and DSK that rely on disk to reduce RAM usage.
Gerbil, DSK, and CHTKC control the amount of main memory they use, so they can count the k-mers in all the data sets without running out of RAM.CHTKC and DSK used RAM up to the set limit of 15GB but made exhaustive use of disk when it was deemed necessary.Because of the disk usage, the comparison between all the programs is not strictly fair.Still, the space usage of Kaarme was usually the smallest, excluding lower k experiments on dmel20, indicating that Kaarme is the most memory frugal.

Concluding remarks
We have presented Kaarme, a space-efficient hash table to count large k-mers in memory.We showed that Kaarme uses up to five times less space than a regular hash table for counting k-mers while being at most 1.5 times slower.When compared to k-mer counters, Kaarme uses the least amount of memory when k is large.
We note that both DSK and CHTKC make use of hash tables in their implementation.Thus, the adaption of Kaarme as a submodule in these tools could allow them to either use less memory or count larger k-mer sets in memory.However, Kaarme takes advantage of the fact that most of the k-mers overlap by k − 1 symbols with a previous k-mer in the input collection.Depending on how DSK partitions the k-mers, the input data set could become more fragmented with a much larger amount of k-mers without predecessors, causing the secondary buffer B to grow significantly.The same could be true for CHTKC, which writes the excess k-mers into disk for the next iteration.
The reconstruction of k-mers in the canonical dictionary of Kaarme can be exponential, but our experiments suggest that, on average, the time complexity seems to be close to linear.Therefore, Kaarme is a practical, space-efficient hash table for large k-mers.
We start the execution by checking if H[b ′ ].r = b pr and K [k] = H[b ′ ].a .When these conditions hold, we return true immediately because the key of H[b pr ] is suffixed by K [1 . . .k − 1] .Notice we can finish the call without further symbol comparisons because incval is a subroutine of GetDict , and this function scans R from left to right.Therefore, it always hold that the bucket H[b pr ] encodes a k-mer K pr that immediately precedes K in R .On the other hand, if b pr = H [b ′ ].r , we start the decompression of H[b ′ ] 's key.We first compare H[b ′ ].a against K[k] and set the next bucket b ′ = H[b ′ ].r if they match.As a general rule, in every ith step, we compare H[b ′ ].a against K [k − i + 1] and return false if they differ or go to the next bucket b ′ = H[b ′ ].r and per- form another symbol comparison.When every symbol K [k − i + 1] matched its corresponding symbol H[b ′ ].a , with i ∈ [1 . . .k] , we can be sure that H[b ′ ] encodes K, so keycomp(H, b ′ , K , b pr ) returns true.The only exception to the procedure of keycomp is when we reach a bucket H[b ′ ] whose key is in B. We can easily detect this situa- tion because H[b ′ ].a = ǫ is a special symbol.When this happens, we get the buffer offset l = H[b ′ ].r and compare the k

Theorem 1
Let G = (V , E) be the de Bruijn graph of order k built from a collection R = {R 1 , . . ., R u } of u strings and ||R|| = n symbols, V being the set of nodes and E the set of edges.GetDict(R, k) requires O(|V | + u k w ) words of space to encode (H, B) and runs in O(nk) expected time.

Fig. 1
Fig. 1 Example of our compact hash table H filled by GetDict using the 4-length k-mers of the read R = cgttagttaa .The arrows indicate the references where we recover the (k − 1) -prefixes of the k-mers.We first map R[1 . . .k] = cgtt to its bucket H[h(cgtt) = 4] .However, as we do not know a reference bucket H[b pr ] to recover the prefix R[1 . . .k − 1] = cgt , we store the k-mer's full sequence in the dynamic buffer as B[l = 1 ... l + k − 1 = 4] = cgtt and store H[4] = (f = 1, r = l, a = ε) .The next k-mer in R is R[2 . . .k + 1] = gtta , whose designated bucket is H[h(gtta) = 7].In this case, we have available a preceding k-mer K pr = cgtt and its bucket b pr = 4 .Thus, we encode gtta as H[4] = (1, 4, a) .We continue with the remaining k-mers of R in the same way [b], the list L = value(H e , b) contains the buckets in I v that are different from H[b].r.We will also add a new field to the buckets of H.The new field H[b].d ≥ 0 will store the number of probing steps incval incurred to reach b from h(K) when inserted K the first time.For example, if probe(h(K ), m, 3) = b , then H[b].d = 3 .We also change the signature of keycomp to keycomp(H, b ′ , d, K , b pr ) , with probe(h, K , d) = b ′ .In other words, d is the number of probing steps incval performed to reach the bucket H [b ′ ] from H[h(K)].We remark that we call keycomp during the execution of incval, so we always know d when we call keycomp.We implement keycomp(H , b ′ , d, K , b pr ) as follows: we start by checking that H [b ′ ].a = K [k] , and return false if they differ.Now suppose H [b ′ ].a = K [k] .If H [b ′ ].r � = b pr , we access the list L = value(H e , b ′ ) and check if one of the buckets in L matches b pr .If that is the case, we return true.On the other hand, if b pr / ∈ {L ∪ H[b ′ ].r} ; and |L| + 2 = σ or H [b ′ ].d � = d , we return false.When H[b ′ ].d = d and |L| + 2 < σ , we compare K against the key in H[b ′ ] as usual.When they match, we store b ′ in L and return true, otherwise we return false.Theorem 2 Let G = (V , E) be the de Bruijn graph with order k built from the collection R of u strings over the constant alphabet of size 4, and with a total of ||R|| = n symbols.An instance of GetDict(R, k) that uses the encoding (H, B, H e ) requires O(|E| + u k w ) words of space and runs in O(n + (|E| − |V |)k + q) expected time, where q is the total number of times GetDict finds colliding k-mers in H. Proof The table H e uses O(|E| − |V |) words of space as it only contains the missing edges of G that are not in H. Thus, the combined space of H and H e is O(|E|) words.If we also consider the buffer B, the final space usage of our compact representation is O(|E| + u k w ) words.In our new encoding, the function keycomp will fully decompress the key of a bucket H[b ′ ] in one spe- cific case: when H[b ′ ] encodes K, but the reference b pr is not in {L, H[b ′ ].r} .This situation is equivalent to dis- covering a new incoming edge for the node v labelled label(K).GetDict discovers indegree(v) − 1 edges this way because the remaining edge is stored in H[b ′ ].r when the algorithm inserts K into H[b ′ ] .Thus, the total cost of counting the f occurrences of K in R (without considering collisions) is f + (indegree(v) − 1)k .If we consider all the nodes of G, the total cost of counting without collisions is O(n + (|E| − |V |)k) expected time.Now let us consider the collisions.The purpose of the field d is to ensure that we decompress a k-mer

Fig. 2 A
Fig. 2 A Example of text overlap.The substring R[x − 1 . . .y] = atgcc encodes two k-mers, R[x − 1 . . .y − 1] = atgc and R[x . . .y] = tgcc .Let us assume the canonical K c pr ∈ {K pr , Kpr } matches the DNA reverse complement of R[x − 1 . . .y − 1] , i.e., gcat .On the other hand, let us assume R[x . . .y] = K c matches the canonical of {K , K } .Thus, the relative DNA orientation of R[x − 1 ... y − 1] ⊕ R[x . . .y] with respect to K c is K c pr ⊕ K c .This means that K c [k − 1] = π(K c pr [1]) = cis the symbol we obtain from the link H[b].r in K c 's bucket H[b].The grey arrow from H to the grey rectangle depicts this situation.B Text overlaps for K c pr and K c relative to K c 's DNA orientation.The x marks the symbol in K c we obtain by following H[b].r Fig. 2 A Example of text overlap.The substring R[x − 1 . . .y] = atgcc encodes two k-mers, R[x − 1 . . .y − 1] = atgc and R[x . . .y] = tgcc .Let us assume the canonical K c pr ∈ {K pr , Kpr } matches the DNA reverse complement of R[x − 1 . . .y − 1] , i.e., gcat .On the other hand, let us assume R[x . . .y] = K c matches the canonical of {K , K } .Thus, the relative DNA orientation of R[x − 1 ... y − 1] ⊕ R[x . . .y] with respect to K c is K c pr ⊕ K c .This means that K c [k − 1] = π(K c pr [1]) = cis the symbol we obtain from the link H[b].r in K c 's bucket H[b].The grey arrow from H to the grey rectangle depicts this situation.B Text overlaps for K c pr and K c relative to K c 's DNA orientation.The x marks the symbol in K c we obtain by following H[b].r Now let us assume without loss of generality that, when we visit H[b j ] in the scan of S, ror S (j) returns 0. Also assume that the information in H[b j ].e and H[b j ].o tells us the link H[b j ].r = b j+1 represents K c j+1 ⊕ K c j .We notice that the overlap H[b j ].r = b j+1 is inconsistent with ror S (j) = 0 because K o j = K c j is the reverse complement of the string we use in K c j+1 ⊕ K c j .We fix this problem by flipping the DNA orientation of the overlap encoded by H[b j ].r to obtain K c j ⊕ K c j+1 , and now the two k-mers are oriented with respect to K c .However, this strand flip has a chain effect because it makes K o j+1 equal to K c j+1 (i.e., ror S (j + 1) = 0 ), and depending on H[b j+1 ].e and H[b j+1 ].o , we might need to flip H[b j+1 ].r = b j+2 as well.We will generally continue flipping k-mers in S until we reach a bucket b j ′ where the text overlap is consistent with ror S (j ′ ).Considering all this information, we can fairly say that traversing S right to left resembles sliding a window over K c back and forward as we visit the buckets.Initially, the window is set to w ℓ = 1, w r = k when we are in H[b 1 = b] .Then, when we reach H[b j ] , we move the window to the left: w

Fig. 3 A
Fig. 3 A Spelling K c = K c 1 = atgat from H[b 1 ] .The white arrow to the left indicates that the figure is read bottom-up.Each jth circle is the bucket H[b j ] in the reference chain S. The black string is K c j and the grey string is K c j .The incomplete string next to each circle has the symbols of K c we know up to that bucket.The green symbol is the one we extract from K o j and insert it in one of the inner ends of K c .The arrow from H[b j ] to H[b j+1 ] indicates the text overlap of H[b j ].r = b j+1 .The red line is the k − 1 prefix in H[b j ] that matches a k − 1 suffix in H[b j+1 ] (blue line).B The k-mers K o j we use in the spell of K c .When we change the spelling direction from H[b j+1 ] to H[b j+2 ] , K o j+2 does not match K o j because the chain of buckets S = b 9 , . . ., b 2 , b 1 spelling K c cannot have repeated elements (seeLemma 3).We mark the mismatching symbols of K o j and K o j+2 with vertical lines in the figure.On the other hand, we remark that changes in the spelling direction are induced by the order in which we insert the k-mers in H and the reference bucket we have available at the moment of the insertion [c ℓ ] or K c [c r ] from the substring of K o j matching K o [c ℓ . . .c r ] .Our proof below uses an arbitrary sequence of slides for w ℓ , w r , but we can give a symmetric argument for other sliding sequences.When we start canspell(b), the matching property is trivially true as c ℓ = 1, c r = k and the bucket H[b 1 = b] encodes precisely the k-mer K o 1 = K c .Now assume the sequence b j , b j−1 , . . .b 1 in S slide the window j − 1 < k positions to the left, so the right boundary now is r = c r − j + 1 .The matching property still holds as the when sliding the window to the right.If we consider this match and K o j [j . . .k − 1] = K c [c ℓ . . .c r ] , then by the transitive overlaps K o j ⊕ K o j+1 • • •K o j+u when sliding the window to the right, we get the match K K o j+u and K c have the same sliding window position w ℓ = 1, w r = k , but they have different sequences due to Lemma 3.However, the substring of K o j+u matching K c [c ℓ . . .c r ] is a prefix and we have access to K o j+u [1] in our encoding.Therefore, we extract K c [c ℓ ] and move c ℓ one position inwards c ℓ = c ℓ + 1. • j ≤ u : for the buckets b j+1 , . . ., b j+j−1 the previ- ous cases apply.The remaining buckets b 2j , . . ., b j+u move the inner left end c ℓ by one position each because the matches induced by the transitive overlaps K o 2j ⊕ K o 2j+1 • • •K o j+u when sliding the window to the right go in the same direction as we move c ℓ .For every K o x , with x ∈ [2j . . .j + u] , we have K o x [1 . . .c r − c ℓ + 1] = K c [c ℓ . . .c r ] , and because we have access to K o r = b j+1 uses K c j for the text overlap between K c j+1 and K c j , and 0 if it uses K c j .Equivalently, H[b j ].e = 1 means the link uses K c j+1 for the overlap, and H[b j ].e = 0 means K c j+1 .Thus, we slide the window to the left if H[b j ].o = 1 , and to the right if H[b j ].o = 0.

Example 1
Finally, we move to the next bucket H[b j+1 ] and repeat the same three steps above.The reconstruction of K c finishes when c ℓ > c r , which means we already cover all the symbols of H[b]'s key. Figure 3 and Example 1 show how we reconstruct K c in the canonical encoding.Spelling of K c = K c 1 = atgat from H[b 1 ] in Fig. 3.We initialise the sliding window w ℓ = 1, w r = k and the inner ends c ℓ = 2, c r = 4 of K c as H [b 1 ] already stores K c [1] and K c [k] .Then, the fields e = 1, o = 0 in H[b 1 ] indicate the overlap K c 1 ⊕ K c 2 , which means sliding the window to the right w ℓ = 2, w r = 6 .Besides, as q = ror S (1) = 1 , we do not invert the sliding direc- tion.However, as H [b 1 ].e � = H [b 1 ].o , q becomes ror S (2) = ¬ror S (1) = 0 .When we visit H [b 2 ] , w ℓ matches c ℓ = 2 , so we set K c [c ℓ = 2] = K o 2 [1] = π(K c 2 [k]) and move K c 's inner left end to c ℓ = 3 .The fields e = 1, o = 0 in H[b 2 ] indicate K c 2 ⊕ K c 3 , which means sliding the window to the right, but as K o 2 = K c 2 ( q = 0 ), we invert the direction and slide the window to the left w ℓ = 1, w r = 5 instead.Further, q = ¬ror S (2) = ror S (3) becomes 1 as H[b 2 ].e � = H[b 2 ].o .When we visit H [b 3 ] , the window does not match the inner ends c ℓ = 3, c r = 4 , so we do not recover any symbol.Additionally, H [b 3 ].e = 1 and H[b 3 ].o= 1 indicate K c 4 ⊕ K c 3 , and because q = 1 , we slide the window to the left w ℓ = 0, w r = 4 .The bit q = ror(3) = ror(4) remains the same as H[b 3 ].e = H[b 3 ].o .In H [b 4 ] , it holds c r = w r = 4 , so we get K c [c r = 4] = K o 4 [5] and set c r = 3 .The win- dow does not match the inner ends c ℓ = 3, c r = 3 in buckets b 5 , b 6 , b 7 , and b 8 , so we do not get any symbol.The window in H [b 9 ] is w ℓ = 3, w r = 7 , and because w ℓ = c ℓ = 3 , we get K c [c ℓ = 3] = K o 9 [1] = π(K c 9 [k]) , set c ℓ = 4 > c r = 3 , and we are done.

Fig.
Fig. Demonstration of the running time of canspell.A The current bucket b j in S's traversal stores K c j = label I (u h )•K c [c ℓ . . .c r ]•label D (v o−h ) (dotted red line).The triangles are the tries I and D. B Sliding w ℓ , w r x = 1 position to the left, and then x = 1 position to the right (i.e., visiting the next two buckets in S).The new node v o−h is different from that of A because of Lemma 3. We already traversed the blue path and can not traverse it again due to Lemma 3. C, D The remaining window sliding options visiting u h .After (D), we can only slide w ℓ , w r to the right without breaking Lemma 3. Therefore, we visited u h σ − 1 = 3 times, with σ being the trie's degree

(
2σ o − 1) − (2σ o−1 − 1)to compare k-mers while probing buckets in the compact hash table (see Algorithm 4.2.2).The function keycomp(H , b ′ , K c , b pr ) receives a bucket H[b ′ ] and an input k-mer K c , and returns true if K c matches the key in H [b ′ ] , and false otherwise.The param- eter b pr is a bucket H [b pr ] encoding K c pr .In the canonical variant of keycomp, we will add two extra parameters e ′ and o ′ .Let R[x . . .y] be the substring in R where we obtain the occurrence of K c we are querying in keycomp, and let R[x − 1 . . .y − 1] the occurrence of the predeces- sor K c pr associated with H [b pr ] .The field o ′ tells the rela- tive orientation of K c pr and R[x − 1 . . .y − 1] and the field e ′ tells the relative orientation of K c and R[x . . .y] .The meaning of the values for e ′ and o ′ is the same as those for H[b ′ ].e and H[b ′ ].o.
H [b ′ ].o ).If they are equal, we conclude that K c equals the k-mer in H [b ′ ] , so we return true.If they differ, we need to reconstruct H [b ′ ] 's key to check if it matches K c .This process is almost the same as running canspell on H[b ′ ] (Algorithm 2).The only difference is that every time we extract symbols for H [b ′ ] 's key (Lines 10, 12,18, or 19), we compare them against their corresponding positions in K c , and if they differ, we return false.If all the symbols in H[b ′ ] 's key matched their corresponding posi- tions in K c , we return true.
pr in H [b pr ] depends on the reconstruction of K c .In this situation, setting H [b].r = b pr creates a cycle in the chain S spelling K c pr from H [b pr ] , thus invalidating Lemma 3. We can easily check this condition by calling keycomp on bucket b pr and checking if the chain of refer- ences includes K c .If this is the case, we keep K c in B to ensure that all k-mers in H are reconstructible.

Lemma 6
Let K c ∈ {K , K } be a canonical k-mer encoded in H[b] with an occurrence K = R i [x . . .y] ∈ R .Additionally, let H[b pr ] be the bucket encoding the canon- ical form of K pr = R i [x − 1 . . .y − 1] .Given an arbitrary non-empty bucket H[b ′ ] , keycomp(H, b ′ , K c , b pr , e ′ , o ′ ) will always stop and return true or false.Proof At the beginning of GetCanDict, H is empty and thus trivially all k-mers can be reconstructed.Let us assume that before we insert a new k-mer K c into H, all k-mers already in H can be reconstructed.When inserting K c , we will either (i) add K c to the dynamic buffer B and store a pointer for B[l . . .l + k − 1] = K c in H[b].r = l or (ii) add the first and last symbols of K c to H[b] together with the bucket H[b].r = b pr of K c pr .In case (i), K c clearly can be reconstructed as we only need to access B using H[b].r= l .In case (ii), we immediately know the first and the last symbols of K c .Furthermore, we have the bucket H[b pr ] storing the k-mer K c pr .By definition, we know that K c overlaps by k − 1 symbol one of the strings {K pr , Kpr } from which K c pr was obtained (bits H[b].o and H[b].e tell us which string, K pr or Kpr , is the one that K c overlaps).Since K c pr is already in H, it can be reconstructed, and thus we can uncover the remaining k − 2 symbols of K c .
symbol in H[b 1 ].a and store it in the rightmost posi- tion W[k] of a buffer W [1 . . .k] .Then, we follow the link H[b 1 ].r = b 2 , retrieve the DNA symbol in H [b 2 ] , and add it to W [k − 1] .We keep applying this idea until we have k symbols in W. Notice we fill W from right to left because the DNA symbols we store in H are in the right end of their k-mers.Once W is full, we are ready to print the kmer K 1 , the one stored in H [i = b 1 ] .We find its count in H[i].f, and if it is above τ , we write W and its frequency in the output file.Then, we follow the reference chain further H[b k ].r = b k+1 .Since W is full, we have to drop the rightmost symbol W[k], shift the buffer by one position to the right and insert the DNA symbol of H [b k+1 ] in W

Fig. 5 Fig. 6 Fig. 7
Fig. 5 Memory usage (left column) and runtime (right column) of the tools on the ecoli100 (top row), ecoli280 (middle row) and dmel20 (bottom row) data sets.Missing columns indicate that k-mers could not be counted using the program

table H
entry (K, f) of key K and frequency f is represented as a string in k and an integer, respectively.Our theoretical descriptions in the following sections assume a suitable size m for H is known prior to the execution of GetDict or GetCanDict.A suitable m is big enough to encode the distinct k-mers of R .Let H be a hash table that uses open addressing to resolve collisions and h a hash function that maps k-mers to buckets in H.We define the following operations: 1. incval(H, K, f): increments the value associated with the key K in H by f or stores a new entry (K, f) in H if K does not exist as key. 2. value(H, K): returns the value associated with the key K in H. 3. probe(h(K), m, d): receives as input the fingerprint h(K) of a k-mer K ("String fingerprints and rolling hashing" section) and returns the hash table bucket in [1 . . .m] that the probing function of the open addressing scheme produces in step d.
The problem arises because we do not know a bucket H[b pr ] encoding a k- mer K pr with R i [1 . . .k − 1] = K pr [2 . . .k] that we can record as a reference in H[b].r.Thus, we get the leftmost available block B[l . . .l + k − 1] , copy K there, and set H[b] = (1, l, ε) , where ε serves as a flag that indicates that K is in the dynamic buffer.We say that B is dynamic because, as soon as we find a bucket H[b pr ] , we remove K from B and store H[b].r= b pr instead.
pr ) : receives as input the hash table H, a bucket b ′ , a k-mer K ∈ * , and a (possible null) bucket b pr whose triplet H [b pr ] encodes a k-mer K pr = c•K [1 . . .k − 1] , with c ∈ , and returns true if the key in H[b ′ ] matches K or false otherwise ′ ] such that keycomp(H, b ′ , K , b pr ) is true, it means K already exist in H, so we just increment H[b ′ ].f by one.Additionally, if H[b ′ ].a = ε and b pr is not null, we